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Abstract 

Bayesian learning is often hampered by large computational expense. As a powerful 
generalization of popular belief propagation, expectation propagation (EP) efficiently 
approximates the exact Bayesian computation. Nevertheless, EP can be sensitive to 
O outliers and suffer from divergence for difficult cases. To address this issue, we pro- 

pose a new approximate inference approach, relaxed expectation propagation (REP). It 
CN relaxes the moment matching requirement of expectation propagation by adding a re- 

laxation factor into the KL minimization. We penalize this relaxation with a l\ penalty. 
V.Q With this penalty, when two distributions in the relaxed KL divergence are similar, 

we obtain the exact moment matching; in the presence of outliers, the relaxation fac- 
tor will used to relax the moment matching constraint. Based on this penalized KL 
minimization, REP is robust to outliers and can greatly improve the posterior approx- 
imation quality over EP. To examine the effectiveness of REP, we apply it to Gaussian 
1 process classification, a task known to be suitable to EP. Our classification results on 

J> synthetic and UCI benchmark datasets demonstrate significant improvement of REP 

over EP and Power EP — in terms of algorithmic stability, estimation accuracy and 
predictive performance. 

Keywords: Approximate Bayesian inference, Relaxed moment matching, Expectation 
propagation, Z x penalty, Gaussian process classification 
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1 Introduction 

Bayesian learning provides a principled framework for modeling complex systems and mak- 
ing predictions. A critical component of Bayesian learning is the computation of posterior 

*This work was sponsored by the NSF grants IIS-0916443, IIS-1054903, ECCS-0941533, and CCF- 
0939370. All the authors gratefully acknowledge the support of the grants. Any opinions, findings, and 
conclusion or recommendation expressed in this material are those of the author (s) and do not necessarily 
reflect the view of the funding agencies or the U.S. government. 
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distributions that represent estimation uncertainty. However, the exact computation is often 
so expensive that it has become a bottleneck for practical applications of Bayesian learning. 
To address this challenge, a variety of approximate inference methods has been developed to 
speed up the computation [Jaakkola, 2000, Minka, 2001, Opper and Winther, 2005, Wain- 
wright and Jordan, 2008]. As a representative approximate inference method, expectation 
propagation [Minka, 2001] generalizes the popular belief propagation algorithm, allows us 
to use structured approximations and handles both discrete and continuous posterior distri- 
butions. EP has been shown to significantly reduce computational cost while maintaining 
high approximation accuracy; for example, Kuss and Rasmussen [2005] have demonstrated 
that, for Gaussian process (GP) classification, EP can provide accurate approximation to 
predictive posteriors. 

Despite its success in many applications, EP can be sensitive to outliers in observation and 
suffer from divergence when the exact distribution is not close to the approximating family 
used by EP. This stems from the fact that EP approximates each factor in the model by a 
simpler form, known as messages, and iteratively refines the messages (See Section 2). Each 
message refinement is based on moment matching, which minimizes the Kullback-Leibler 
(KL) divergence between old and new beliefs. The messages are refined in a distributed 
fashion — resulting in efficient inference on a graphical model. But when the approximating 
family cannot fit the exact posterior well — such as in the presence of outliers — the message 
passing algorithm can suffer from divergence and give poor approximation quality. 

We can force EP to converge by using the CCCP algorithm [Yuille, 2002, Heskes et al., 
2005]. But it is slower than the message passing updates. Also, according to Minka [2001], 
EP diverges for a good reason — indicating a poor approximating family or a poor energy 
function used by EP. 

To address this issue, we propose a new approximate inference algorithm, Relaxed Expec- 
tation Propagation (REP). In REP, we introduce a relaxation factor r in the KL minimization 
used by EP (See Section 3) and penalize this relaxation factor. Because of this penaliza- 
tion, when the factor involved in the KL minimization is close to the current approximation, 
REP reduces to EP; when the factor is an outlier, the relaxation is used to stabilizing the 
message passing by relaxing the moment matching constraint. Regardless of the amount of 
outliers in data, REP converges in all of our experiments. To better understand REP, we 
also present the primal energy functions in Section 3. It differs from the EP energy function 
or the equivalent Bethe-like energy function [Heskes et al., 2005] by the use of relaxation 
factors. 

To examine the performance of REP, in Section 5, we use it to train Gaussian process 
classification models for which EP is known to be a good choice for approximate inference 
[Kuss and Rasmussen, 2005]. In Section 7, we report experimental results on synthetic and 
UCI benchmark datasets, demonstrating that REP consistently outperforms EP and Power 
EP — in terms of algorithmic stability, estimation accuracy, and predictive performance. 
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2 Background: Expectation Propagation 



Given observations T>, the posterior distribution of a probabilistic model with factors {^(w)} 
is 

P(w|2>) = i J] ^ w )' C 1 ) 

0=1,...,7V 

where Z is the normalization constant. Note that the prior distribution over w is the factor 
to in the above equation and a factor tj(w) may link to one, several, or all variables in w. In 
general, we do not have a closed-form solution for the posterior calculation. We could use 
random sampling methods — such as the Metropolis Hasting — to obtain the posterior distri- 
bution, but these methods can suffer on slow convergence, especially for high dimensional 
problems. 

To reduce the computational cost, Minka [2001] proposed EP to approximate the poste- 
rior distribution p(w\D) by q(w) via factor approximation: 

<?(w)=n^ w ) (2) 

i 

where i$(w) approximates £j(w) and has a simpler tractable form. EP requires both q(w) and 
the approximation factor £j(w) have the form of the exponential family — such as Gaussian 
or factorized (or some structured) discrete distributions. The approximation factors are 
unnormalized, but given them, we can also easily obtain the natural parameters of the 
approximate posterior q(w) due to the log linear property of the exponential family. For a 
graphical model representation, we can interpret the approximation factor ij(w) as a message 
from the i th exact factor tj(w) to the variables linked to it. 

To find the approximate posterior q, after initializing all the messages as one, EP iter- 
atively refines the messages by repeating the following three steps: message deletion, belief 
projection, and message update, on each factor. In the message deletion step, we com- 
pute the partial posterior g^(w) by removing a message U from the approximate posterior 
g old (w): g\*(w) oc g old (w)/tj(w). In the projection step, we minimize the KL divergence 
between pi(w) oc tj(w)g^(w) and the new approximate posterior q(w), such that the infor- 
mation from each factor is incorporated into g(w). Finally, the message ti is updated via 
U(w) oc g(w)/g^(w). 

Since g(w) is in the exponential family, it has the following form 

g(w) oc exp(i/ T 0(w)) 

where <^(w) are the features of the exponential family. Given this representation, the KL 
minimization in the key projection step is achieved by moment matching: 

J 0(w)j3j(w)dw = J 0(w)q(w)dw (3) 

This KL minimization distributed on each factor works very well, when the data is 
relatively clean and the approximate posterior q is not too far from pi. However, in practice, 
the presence of outliers can ruin the distributed KL minimization and leads to divergence of 
the algorithm. 
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3 Relaxed Expectation Propagation 



In this section, we first present the new relaxed expectation propagation framework, discuss 
the choice of relaxation factors, and then describe its primal energy functions. 



3.1 The REP Algorithm 

To reduce the impact of outlier factors, we can introduce relaxation factor ri(w) oc exp(r]J <p(w)) 
into the KL divergence. And to avoid too much relaxation we use a l\ penality over it: 

KL r {piri\\qri) + clrjilx (4) 

over q and rj, where |r/^i is the l\ norm of 77^, the weight c controls how much relaxation we 
have, and the KL r divergence is defined for unnormalized distributions. 

This replacement allows us to adaptively handle factors — whether it is an outlier or not, 
and accurately approximate the posterior distribution p(w|P) (1) by q(w) cx rii^( w )- 

With this relaxed KL divergence, we obtain the following REP algorithm: 

1. Initialize g(w) as the prior to( w ) (assuming the prior is in the exponential family) and 
all the messages tj(w) = 1 for i = 1, . . . , N. 

2. Loop until convergence or reaching the maximal number of iterations. 

• Loop over factor i — 1, . . . , N: 

(a) Message deletion: Based on the current factor ti and g old , calculate the 
partial belief 

ocq old (w)/U(w). 

(b) Belief projection: Incorporate information from the exact factor t t into the 
new belief q by minimizing the penalized KL: 

vcnnKL r {Uriq\ l \\qri) + c\'q i \ 1 (5) 

n,g 

where Pi(w) = tj(w)rj(w)g^(w). 

(c) Message update: Update the message based on the new belief: 

ti(w) cx q(w)/q v (w). 

Unlike EP, REP does not require strict moment matching between Pi(w) oc tj(w)g^(w) 
and the new approximate posterior g(w). How close these moments are depends on how big 
r] i is in the l\ penalized relaxation factor r^. 



3.2 Choice of relaxation factors 

For the relaxation factors rj(w) = expf^^w)), we should parameterize r\ i in a form to 
make the minimization of (5) easy. Clearly, there are many choices available for us. A 
convenient one is to set (part of) rj i to be a scaled version of the parameters of an old 
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message ti, which can damp the influence of outliers via relaxed moment matching, but it 
will not cause double-counting of factors. The reason is that appears in both sides of (5) 
and the new posterior q does not include r,. With this choice, we can use moment matching 
to easily obtain an analytical solution for the product of q and r i: greatly simplifying the 
joint optimization over q and V{. This makes the computational overhead of REP over EP 
negligible in practice. 

If we choose a form of that makes the joint minimization over and (i.e., belief) q 
expensive, we can still use a sequential minimization procedure: first minimize the penalized 
KL to obtain based on the current q; and then, based on the estimated relaxation factor, 
minimize the relaxed KL to obtain the new q. 



3.3 Energy function 

Now we give the primal and dual energy functions for relaxed expectation propagation. The 
primal energy function is 



mm max 



53 i / Pi (w)ri(w) log- 
g(w) 

Z q p(w) 



rum i ZiJ w Ziti(w)p(w) 

1 f <?( w ) 

■(n-1)— / q w )n w log— — - + c > 7?i 6 

Z q iw Zqp(w) V 



subject to 



If If 

w)p i (w)r i (w)dw = — / 0(w)g(w)rj(w)dw (7) 



where f w p~j(w)dw = 1, J w g(w)dw = I, Z { = f w p i (w)r i (w)dw, and Z g = f yf q(w)r i (w)dw. 

Based on the KL duality bound, we obtain the dual form of the energy function (See 
the Appendix for details). Setting the gradient of the dual function to zero gives us the 
fixed-point updates described in the previous section. The fixed-point updates, however, do 
not guarantee convergence, just like the classical EP updates. However, REP is much more 
robust than EP; in our experiments while EP diverges on difficult datasets, REP does not 
diverge in our experiments once. 

We believe the robustness of REP comes from the relaxation of moment matching in 
(7): it does not demand the moments of pi and q to be exactly matched as in EP. Given an 
outlier factor, the exact moment matching requires the current q moves dramatically to a 
new q, ignoring all the information from the previous factors, summarized in the current q. 
And this can cause oscillations, reducing the final approximation accuracy. 

From an optimization perspective, the min-max cost function (6) includes the cost func- 
tion of EP as a special case by setting ri(w) = 1. By tuning ri(w), it is possible to find a 
better solution to the min-max optimization. As shown by Heskes et al. [2005], the cost func- 
tion of EP corresponds to the Bethe energy, an entropy approximation, with exact moment 
matching constraints. With relaxed moment matching, we can potentially obtain better 
entropy approximation (We will further our research along this line in the future). 

Finally we want to stress that by REP robustifies EP to obtain an more accurate posterior 
approximation, rather than ignoring information from outliers, as shown in figure 1. 
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4 REP training for Gaussian process classification 



In this section, we present a REP-based training algorithm for Gaussian process classification. 
First, let us denote N independent and identically distributed samples as V = {(x^ yi)}^Li, 
where x« is a d dimensional input and yi is a scalar output. We assume there is a latent 
function / that we are modeling and the noisy realization of latent function / at Xj is 

We use a GP prior with zero mean over the latent function /. Its projection at the 
samples {x,} defines a joint Gaussian distribution: p(f) = J\f(f\0,K) where = /c(xj,Xj) 
is the covariance function, which encodes the prior notation of smoothness. For classification, 
the data likelihood has the following form 

p( yi \f) = (1 - e)Q(f(x i )y i ) + ee(-/(x^) (8) 

where e models the labeling error, and 0(a) = 1 when a > (0(a) = otherwise). 
Given the GP prior over / and the data likelihood, the posterior process is 

N 

p(f\V) «GP(/|0,lOnP(!*l/(*i)) ( 9 ) 

i=i 

Due to the nonlinearity in p(yi\f), the posterior process does not have a closed-form solution. 

Using REP, we approximate each non Gaussian factor p(?/j|/(xj)) by a Gaussian factor 
U(fi) = J\f(fi\mi,Vi). Then we obtain a Gaussian process approximation to (9): 

N 

p{f\V,t) <xGP{f\^K)\[M{fi\mi,Vi) (10) 

i=i 

We parameterize the relaxation factor 7*j as an Gaussian: 

nifi) ocJVX/iKA), (11) 

so that Ti share the mean as U and 6j is the only free parameter in r^. For the convenience 
of the following presentation, we define Uj,{fi) = ^{fi\m% t bjVifi) cx ri(fi)ti(fi). Now we give 
the relaxed EP algorithm for training a GP classifier. 

1. Initialize rrii — 0, Vi — oo, and hi = for ti. Also, initialize rj, hi — 0, A = K, and 
Aj = Kjj. 

2. Until all (m,i,Vi, hi) converge: Loop i — 1, . . . , N: 

(a) Remove ti from the approximated posterior: 

\Y = (J- - -)- 1 ^ = a, + A^VC^ - (12) 

(b) Minimize the relaxed KL divergence over hi (i.e., r«) by line search (See the Ap- 
pendix) . 
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(c) Multiple q\ l with r«: 



K V = 1/(1/ \Y + b t ) h>' = hf - X^XihY - tth) (13) 

(d) Minimize the relaxed KL divergence to obtain ti >b : 

1 (1 - 2e)Af(z\0,l) - ~\i ~\i 

a= , —ft 1 \, / hi = hi X +Xi X a 14 

fj\i e+(l-2e)if>(z) 



~ \i 1 

Ui,b = A, (— ~- - 1) m ij6 = hi + v i>b a (15) 



where z = hi /y A, and is the standard normal cumulative density distri- 
bution. 

(e) Remove r, from tj ^ to obtain tf. 

Vi = l/{l/v iib + h) mi = Vi(m i:b /vi :b + mf A bi) (16) 

(f) Update A and hf 

j j 

where 5 = 1/(1 /vi — l/f° ld ) and aj is the i-th column of A. 
We will release our software implementation upon the publication. 

5 Related works 

Minka [2005] proposed Power EP (PEP) via the use of the a-divergence [Zhu and Rohwer, 
1995]. The framework includes EP, fractional Belief propagation [Wiegerinck and Heskes, 
2002] , and variational Bayes as special cases, each of which is associated with a particular 
value a in the a divergence. In the presence of outliers, by using a power smaller than one 
for factors, Power EP increases the algorithmic stability over EP. But it also changes the 
divergence used for minimization to an a-divergence that is different from KL, the desired 
divergence for many problems (e.g,. classification). In contrast, REP adaptively relaxes the 
KL minimization for individual factors only when it becomes necessary. 

We can damp the step size for message updates to help convergence, as suggested in 
[Minka, 2004]. But for difficult cases, we need to use a very small step size, greatly reducing 
the convergence speed. Furthermore, damping does not guarantee convergence. As a result, 
without using any stepsize, our approach is a good alternative to fix EP for difficult cases. 



7 



6 Experiments 



In this section, we compare EP, PEP, and REP on approximation accuracy, convergence 
speed, and prediction accuracy for on Gaussian process classification. We chose GP classi- 
fication as the test bed because EP has shown to be an excellent choice for approximation 
inference with GP classification models [Kuss and Rasmussen, 2005]. For EP, we used the 
updates described in Chapter 5.4 of the Thesis of Minka [2001]. Since there is no previous 
work that uses PEP for training GP, we derived the updates and described them in the 
Appendix. The reason we compared REP with PEP is because PEP can also help stabilize 
EP, possibly improving the approximation quality. 

6.1 Evaluation of posterior approximation accuracy 

First, we considered linear classification of five data points shown in Figure 1. The red 
'x' and blue 'o' data points belong two classes. The red point on the right is mislabeled. 
To reflect the true labeling error rate in the data, we set e = 0.2 in (8). To obtain linear 
classifiers, we use the linear kernel — fc(xj,Xj) = x^Xj for the three algorithms. After the 
algorithms converge, we can obtain the posterior mean and covariance of a linear classifier 
w in the 2-dimensional input space. 

To measure the approximation quality, we first used importance sampling with 10 s sam- 
ples to obtain the exact posterior distribution of the classifier w. We then applied these 
algorithms to obtain the approximate posteriors. We treated the (approximate) posterior 
means as the estimated classifiers and used them to generate their decision boundaries. They 
are visualized in Figure l.a. For PEP, we set the power u to 0.8; for REP, we set c = 20. 
Given the outlier on the right, the EP decision boundary significantly differs from the ex- 
act Bayesian decision boundary obtained from the importance sampling; the PEP decision 
boundary is closer to the exact one; and the REP decision boundary overlaps with the exact 
one perfectly. 

We also varied the relaxation weight c in (5) for REP and the power for PEP to examine 
their impact on approximate quality. We measured the mean square distances between the 
estimated and the exact mean vectors; we also computed the mean square distances between 
the estimated and the exact covariance matrices. The results are summarized in Figure l.b 
to I.e. For PEP, as shown in l.b to l.c, although the decision boundaries appear to be more 
aligned with the exact posterior distribution, their estimated mean and covariance are always 
worse than what EP achieve. This suggests that although PEP does reduce the influence 
of the outlier, it does not provide better approximation. By contrast, for REP, when c is 
big, the l\ penalty forces the relaxation factor = (i.e., = 1) and, accordingly, REP 
reduces to EP and gives the same results; And when c is relatively smaller (for a wide range 
of values), REP not only is immune to the presence of the outlier, but also improves the the 
approximation quality significantly. 

Finally, we emphasize that REP aims to provide an accurate posterior approximation, 
regardless of likelihoods we used. For example, with various values of e (e.g., 0.1 and 0.25) 
in (8), REP consistently provides more accurately results than EP and PEP. 



-EP --PEP —REP --Monte Carlo 




Figure 1: Classification of five data points, among which the red data point on the right is 
mislabeled, (a): Decision boundaries of EP, Power EP, and Relaxed EP; (b) and (c): EP 
vs Power EP with different powers u; (d) and (e): EP vs Relaxed EP with different penalty 
weights c. REP reduces to EP when c is big. For a wide range of c values, the REP's 
approximation accuracy is significantly higher than those of EP and Power EP. 



6.2 Results on synthetic data 

We then compared these algorithms on a nonlinear classification task. We sampled 200 data 
points for each class: for class 1 the points were sampled from a single Gaussian distribution 
and, for class 2, the points from a mixture of two Gaussian components. The data points are 
represented by red crosses and blue circles for the two classes (See Figure 2). We randomly 
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(c) GP-REP: 10 iterations 

Figure 2: Decision boundaries of EP, Power EP, and REP. 20% of the data points are 
mislabeled. 

flipped the labels of some data points to introduce labeling errors; we varied the error rates 
from 10% to 20%. And for each case, we let e match the error rate. We used a Gaussian 
kernel for all these training algorithms and applied cross-validation on the training data to 
tune the kernel width. We also tuned the relaxation weight c for REP and the power for 
PEP. 

In Figure 2, we visualized the decision boundaries EP, PEP, and REP on one of the 
datasets with 20% labeling errors. To obtain these results, we set the power u = 0.8 for 
Power EP and c = 10 for Relaxed EP. Clearly, EP diverges and leads to a chaotic decision 
boundary. PEP converges in 20 iterations and gives a decision boundary — better than that 
of EP but with strange shapes. Finally, REP converges in only 10 iterations and provides a 
much more reasonable decision boundary than PEP. 

To illustrate the convergence of PEP and REP, we visualized in Figure 3 the change of 
the GP parameter a along iterations: R(iter) = \\a iter — a^ ej ._ 1 || 2 . Clearly, PEP and REP 
are stabler than EP whose estimates oscillate — reflected by pikes in the R curve. 
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Figure 3: Change in GP parameters along iterations. 



We repeated the experiments 10 times; each time we sampled 400 training and 39,600 
test points. Figure 4 summarizes the results. Figure 4. a shows that the number of iterations 
before convergence. The results are averaged over 10 runs. To reach the convergence, we 
required R < 10~ 3 . Clearly, REP converges faster than PEP and EP. 

Figure 4.b shows that while EP and PEP can diverge (PEP diverges less frequently than 
EP), REP always converges. Figure 4.c shows that REP gives significantly higher prediction 
accuracies than EP and PEP. Note that here we did not randomly flip the labels to introduce 
labeling errors in the test data and the prediction errors can be lower than the labeling errors 
in the training sets. 

6.3 Results on real data 

Finally we tested these algorithms on five UCI benchmark datasets: Heart, Pima, Diabetes, 
Haberman, and Spam. 

For the Heart dataset, the task is to detect heart diseases with 13 features per sample. 
We randomly split the dataset into 81 training and 189 test samples 20 times. For the 
Pima dataset, we randomly split it into 319 training and 213 test samples, again 20 times. 
For the Diabetes dataset, medical measurements and personal history are used to predict 
whether a patient is diabetic. Ratsch et al. [2001] split the UCI Diabetes dataset into two 
groups (468 training and 300 test samples) for 100 times. We used the same partitions in 
our experiments. For the Haberman's survival dataset, the task is to estimate whether the 
patient survive more than five years (including 5 years) after a surgery for breast cancer. The 
whole dataset contains information from 306 patient samples and 3 attributes per sample. 
We randomly split the dataset into 183 training and 123 test samples 100 times. Note that we 
did not add any labeling errors to these four datasets. Figure 5 summarizes the results. The 
prediction accuracies of GP-EP and GP-REP are averaged over the splits of each dataset. 
REP outperforms the competing algorithms significantly. 

For the Spam dataset, the task is to detect spam emails. We partitioned the dataset to 
have 276 training and 4325 test samples, and flipped the labels of randomly selected data 
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Figure 4: Comparison of EP, Power EP, and Relaxed EP on two datasets with different 
labeling noise levels. Relaxed EP always converges. And with fewer iterations, Relaxed EP 
consistently achieves higher prediction accuracies than EP and Power EP. 
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Figure 5: Test error rates of EP, PEP and REP on four UCI benchmark datasets without 
additional labeling noise. 
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Figure 6: Test error rates of EP, PEP and REP on Spam dataset. We flipped the labels of 
some randomly selected data points to examine how these algorithms perform with outliers. 



points from both the training and test samples. The experiment was repeated for 100 times. 
Figure 6 demonstrated that, with various additional labeling error rates, REP consistently 
achieves higher prediction accuracies than both EP and PEP. 



7 Conclusions 

In the paper we have introduced a method to increase the stability and approximation quality 
of EP. We relax the moment matching requirement of EP with a l\ penalty. Experimental 
results on GP classification demonstrate that the new inference algorithm avoids divergence 
and gives higher prediction accuracy than EP and Power EP. 
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Appendices 

A Primal and dual energy functions for relaxed EP 

The primary energy function of relaxed EP is 




(18) 



subject to 




(22) 



(19) 



(20) 



(21) 



(23) 



r;(w) oc exp(T7f0(w)) 



(24) 
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where c is the constant and is the relaxation factor. 

Based on the KL duality bound, we obtain the dual energy function. 

minmin max(n — 1) log / p(w) exp(i/ T 0(w) + r]J(j)(w))dw 
n v \ J w 

n „ 

_ 5Z l0g / *t(w)f>(w)exp(Af0(w) + r ?f ( / , ( W )) rfw + C $^l r7i l ] 
i=l ^ w i 



(25) 



n 



Setting the gradient of the above function to zero gives us the fixed-point updates described 
in the Section 3 of the main text. The fixed-point updates, however, do not guarantee 
convergence. But because of the relaxed KL minimization, REP always converges in our 
experiments (while EP can diverge when given many outliers). 

Now we prove the duality of the relaxed EP energy function. Applying the KL duality 
to the first term in (6)produces 

I [ p i (w)r i (w)log P ; {W) (27) 

Pi(w)ri(w) log 



Zi J w Z i t i (w)p(w)r i (w) 
= max-!- / p i (-w)r i (w)X i (w)dw - log / tj(w)p(w)rj(w) exp(A;(w))dw 

This is because the maximum of the right side of (27) is achieved when (taking derivative 

tO A,;(w)) 

i^(w)n(w) - t (w) P (w)r (w)exp(A (w)) = 

^ 1 ' 1 ' r w t,(w)p(w)r,(w)exp(A l (w))dw 1 J 

which means 

f\ t w ft( w ) r i( w )/ w ^( w M w ) r i( w )exp(A i (w))rfw 

exp(A,(w)) = (29) 

ZA(w)p(w)ri(w) 

Inserting exp(Aj(w)) in (27) proves the KL duality for (27). 
And from the stationary condition, we can assume w.l.o.g. that 

Ai(w) = Af0(w) (30) 

I [ ft(w)r,(w) log t/tt-t (31) 
= max— / pi(w)rj(w)A^ 0(w)c/w — log / tj(w)p(w)rj(w) exp(Af 0(w))<fvv 

^ Zj J *w J w 
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Similarly, we have 



g(w)ri(w)log 



^ g iw zT 9 p(w)r;(w) 



= mm 



^- A i/(w)g(w)rj(w)dw + log / p(w)rj(w) exp(i/(w))dw 
\- I v T <fi(w)q(w)ri(w)dw + log / p(w)rj(w) exp(i/ T 0(w))rfw 



With the constraint ((n — = ^ Aj) and (2), we obtain the dual energy function: 
minminmax(n — 1) log / p(w)rj(w) exp(i^ T 0(w))rfw 

" " A Jw 

-J^log / t i (w)p(w)r i (w)exp(Af0(w))rfw + c^|r7 i | 1 (33) 
i=i >/w » 

(n-l)i/ = J]A i (34) 



B Relaxed KL for GP classification 

For GP classification, we minimize the relaxed KL divergence with li penalty over foj by line 
search. Here we present how to compute the value of this cost function: 

Q(b i ) = KL r (t i r i q\ i \\r i q) + c\b i \ (35) 

Following the notations in the main text (from equations (16) to (23)), we have Q(bi) as 

i {[(! - e ) ^Sl 1 - e) - e log e] tp{z) + e log e} + ^(i 7 ^ - hm^) 
-J log f 1 + (h + — )A, V ) + I log&AV + 1) - ^,(m? - 2m 4 ft t + F i>6 ) 

+ ^7v , - lQ g^ + c N (36) 
where Zi = e + (1 — 2e)'0(z), and the term can be computed as follows: 

s * - <4 - \r <37) 

Or?'}- 



n new 



~al™ = al ew (l -r) (39) 



ii 

r 2 



= «rr + ^ (40) 

Using the above equations, we can efficiently optimize Qipi) over bi via line search. 
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C Power EP for GP classification 

In this section, we describe how to train GP classifiers by Power EP. The updates of Power EP 
are the same as equations (5.64) to (5.74) in [Minka, 2001], except two critical modifications: 

• Replace equation (5.67) in [Minka, 2001] by 

1 {(l- e y-e-\N{z\U,l) 
1 v ^e« + [(l-e)«-e«]^(z) 1 ] 

where is the standard normal cumulative density function and u is the power used 
by Power EP. 

• Moreover, after (5.70), scale V{ by u: 

Vi 4- uvi (42) 
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